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[57] ABSTRACT 

A system for data compression utilizing systolic array archi- 
tecture for Vector Quantization (VQ) is disclosed for both 
full-searched and tree-searched. For a tree-searched VQ, the 
special case of a Binary Tree- Search VQ (BTSVQ) is 
disclosed with identical Processing Elements (PE) in the 
array for both a Raw-Codebook VQ (RCVQ) and a Differ- 
ence-Codebook VQ (DCVQ) algorithm. A fault tolerant 
system is disclosed which allows a PE that has developed a 
fault to be bypassed in the array and replaced by a spare at 
the end of the array, with codebook memory assignment 
shifted one PE past the faulty PE of the array. 

4 Claims, 12 Drawing Sheets 
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PIPELINE SYNTHETIC APERTURE RADAR 
DATA COMPRESSION UTILIZING 
SYSTOLIC BINARY TREE-SEARCHED 
ARCHITECTURE FOR VECTOR 

QUANTIZATION 5 

ORIGIN OF THE INVENTION 

The invention described herein was made in the perfor- 
mance of work under a NASA contract, and is subject to the 10 
provisions of Public Law 96-517 (35 USC 202) in which the 
Contractor has elected not to retain title. This is a continu- 
ation of application Ser. No. 07/550,775 filed Jul. 10, 1990, 
now abandoned. 

TECHNICAL FIELD 15 

The invention relates to a systolic array architecture for a 
Vector Quantizer (VQ) for real-time compression of data to 
reduce the data communication and/or archive costs, and 
particularly to a tree- searched VQ. 20 

BACKGROUND ART 

Efficient data compression to reduce the data volume 
significantly decreases both data communication and 
archive costs. Among existing data compression algorithms, 25 
Vector Quantization (VQ) has been demonstrated to be an 
effective method capable of producing good reconstructed 
data quality at high compression ratios. The primary advan- 
tage of the VQ algorithm, as compared to other high 
compression ratio algorithms such as the adaptive transform 30 
coding algorithm, is its extremely simple decoding proce- 
dure, which makes it a great potential technique for the 
single- encoder, multiple-decoder data compression systems. 

The VQ algorithm has been selected as the data compres- 
sion algorithm to be used for rapid electronic transfer of 35 
browse image data from an on-line archive system to end 
users of the Alaska SAR Facility (ASF) and the Shuttle 
Imaging Radar C (SIR-C) ground data systems. For this 
on-line archive application, VQ is required to reduce the 
volume of browse image data by a factor of 15 to 1 so that 40 
the data can be rapidly transferred through the Space Physics 
Analysis Network (SPAN) having a 9600 bits per second 
data rate, and be accurately reconstructed at the sites of 
scientific users. 

45 

Another application of the VQ algorithm is the real-time 
downlink of the Earth Observing System (EOS) on-board 
processor data to the ground data users. For this data 
downlink application, VQ is required to reduce the volume 
of image data produced by the on-board processor by a 5Q 
factor of 7 to 1 so that the data can be transferred at real-time 
through the direct downlink channel limited at 1 Megabits 
per second data rate. These flight projects are currently 
undertaken by the National Aeronautics and Space Admin- 
istration (NASA) for imaging and monitoring of global 55 
environmental changes. 

Aside from these space applications, VQ can also be 
applied to a broad area in commercial industry for data 
communication and archival applications, such as digital 
speech coding over telephone lines, High Definition TV go 
(HDTV) video image coding and medical image coding. 

Vector quantization is a generalization of scalar quanti- 
zation. In vector quantization, the input data is divided into 
many small data blocks (i.e., data vectors). The quantization 
levels (i.e., codevectors) are vectors of the same dimension 65 
as the input data vectors. A general functional block diagram 
for vector quantization is shown in FIG. 1 . A codebook 10 


2 

comprised of codevectors C 0 , C l5 . . . , C^, is used at the 
transmit end of a communication channel 11 for data encod- 
ing and a duplicate codebook 10 ' is used at the receive end 
for data decoding. An encoding functional block 12 carries 
out the algorithm indicated by 

z 1 * 1 = min" 1 D(x [k \c,), (1) 

0 = Z =5 iV “ 1 

where: x lk] represents the input data vector at time k; C t is 
the codevector; D(x [ * ] , C,-) is the distortion function; N the 
total number of codevectors; and i [ * ] the optimal codevector 
index. The procedure defined by that equation is to select the 
stored codevector which yields the minimum distortion 
between an input data vector x [k] and the stored codevectors 
C 0 , Cj . . . , C^j. The optimal index i [ * ] transmitted through 
the channel 11 is used at the receive end for the decoding 
function in block 13 carried out by using the index i [fc] to 
look up the codevector Qm in the codebook 10 ' that is then 
used as the reconstructed data vector x [k] , which closely 
approximates the original data vector x c * ] . The decoding 
procedure can be expressed as 

x [ * ] =qw (2) 

which is a table look-up procedure. Data compression is 
achieved since fewer bits are needed to represent the code- 
vector indices than the input data vectors. 

The codebook is generated by training a subset of the 
source data. The performance of the codebook is highly 
dependent on the similarity between the training data and the 
coded data. It then follows that the encoding procedure need 
only involve computing the distortion between each input 
data vector and all of the stored codevectors to select the best 
match. This algorithm is known as the full- searched VQ 
algorithm. 

The major drawback of the full-searched VQ algorithm is 
the high complexity involved in drawing up (training) the 
codebook and then data encoding, which poses a great 
challenge for real-time application. To reduce the encoding 
complexity, the tree-searched VQ algorithm is employed 
such that the complexity only grows linearly rather than 
exponentially as the codebook size increases. For the tree- 
searched VQ, the codebook is divided into several tree 
levels, as illustrated in FIG. 2 for a 2-level tree- structured 
codebook. In the encoding process, the input data x [ * 3 is first 
compared with the first level codebook C 0 ,C X , . . . j. 
Based on the selected codevector, the input data x [ * 3 vector 
is then compared with the codevectors of the corresponding 
second level subcodebook C 00 ,C 01 . . . C Nl _ l9N7 _ v This 
encoding procedure is repeated until the input data vector is 
compared with the last level subcodebook. The best matched 
codevector at the last level subcodebook is then used to 
represent this input data vector. 

STATEMENT OF THE INVENTION 

An objective of this invention for real-time data compres- 
sion is to employ a systolic process in the VQ encoding 
procedure by taking advantage of the regular data flow 
pattern inherent in the VQ algorithm, particularly with a 
tree-searched codebook. By a combination of tree-searched 
VQ and systolic processing, a high throughput data com- 
pressor can be realized at a low hardware cost to meet the 
real-time rate requirement. This is the main theme of this 
invention. Thus, the primary objective of this invention is to 
provide a data compression system that can achieve a 
real-time encoding rate with small hardware cost utilizing 
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systolic array architecture for a tree- searched VQ algorithm. 

The systolic array consists of a network of identical 
Processing Elements (PE) that rhythmically process and 
pass data among themselves. It exploits design principles 
such as modularity, regular data flow, simple connectivity 5 
structure, localized communication, simple global control 
and parallel/pipeline processing functions. The systolic 
array is an effective architecture for implementation of 
matrix type computation. 10 

This invention applies the systolic array architecture to 
both full-searched and tree-searched VQ. Briefly, the encod- 
ing procedure of a full-searched VQ can be formulated as a 
matrix-vector computation in a general form, where the 
multiply operator represents the scalar distortion computa- 15 
tion and the add operator represents the summation of 
weighted scalar distortions, while the encoding procedure of 
a tree- searched VQ can be formulated as a series of matrix- 
vector computations with proper access to codevectors in the 
subcode-books. Examples are specifically given for a Binary 20 
Tree-Searched VQ (BTSVQ) of both a raw codebook and a 
difference codebook referred to hereinafter as RCVQ and 
DCVQ, respectively. 

A secondary objective of this invention is to provide a 
fault tolerant systolic VQ encoder by including a spare 
Processing Element (PE) in a systolic array of PEs and a 
means for detection and replacement of a faulty PE with the 
spare PE to enhance the system reliability. 

The novel features that are considered characteristic of 30 
this invention are set forth with particularity in the appended 
claims. The invention will best be understood from the 
following description when read in connection with the 
accompanying drawings. 


BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a generalized functional block diagram of the 
prior-art vector quantization (VQ) algorithm. 40 

FIG. 2 is a diagram of the prior-art encoding procedure of 
the 2-level tree- searched vector quantization algorithm. 

FIG. 3 illustrates a block diagram of a systolic full-search 
vector quantizer. 45 

FIG. 4 illustrates a systolic architecture for a Binary 
Tree-Search Vector Quantizater (BTSVQ). 

FIG. 5 illustrates major functional blocks of a systolic 
binary tree-searched VQ encoder as applied to EOS on- 5Q 
board S AR processor. 

FIG. 6 illustrates a functional block diagram of a BTSVQ 
Processing Element (PE) in the system of FIG. 5 for RCVQ. 

FIGS, la and lb together illustrate a detailed functional 
design of the memory bank shown in FIG. 5. 55 

FIG. 8 illustrates a major functional, block diagram for a 
systolic vector quantizer in which each vector quantization 
processing element has its own codebook memory. 

FIG. 9 illustrate the distortion computing data path of the 6Q 
processing elements in the system of FIG. 8. 

FIG. 10 illustrates a preferred implementation for the 
processing elements of FIG. 9. 

FIG. 11 illustrates fault tolerance augmentation of a 
systolic vector quantization array using a spare processing 65 
element and dynamic reconfiguration switches for replacing 
a processing element when it is found to have a fault. 
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DETAILED DESCRIPTION OF THE 
INVENTION 

As noted hereinbefore, Vector Quantization (VQ) is 
essentially a generalization of scalar quantization. For input 
image data, the stream of input pixels is divided into vectors 
(small blocks of pixels, e.g., 4x4 pixel blocks) and for a 
full-searched VQ, each input data vector is compared with 
every vector stored in a codebook. The index of the code- 
book vector of the smallest distortion is chosen as the 
encoded quantization vector to be transmitted. To reduce the 
encoding complexity, the tree-searched VQ technique is 
employed. This technique divides the codebook into levels 
of subcodebooks of a tree structure as illustrated in the 
background art section. The input data vector is successively 
compared with the stored codevectors in the subcodebook 
levels, i.e., 

if= min f >!>(*%) (3) 

0 § i'i § Ni — 1 

4*’ = min' 1 D(x [k \Q i mQ 
0S; 2 SiV 2 -l 


II = nrin -1 D(x m ,C i} [k]i 2 lk] . . . ijiy 
0§ii.£JVi-l 

where x [k] is the input data vector sequence, k represents the 
time index, and the codevector notation is: C fi for level 1; 
C iA for level 2; and so forth with C £l<2 . . . for level L. The 
distortion function is Dfr^Q ) and the output coded data 
sequence is i [fc] . The number of bits per input pixel is K and 
the input vector dimension is m pixels. Decoding is still a 
table look-up procedure, 

x w =C f lk]=Q 1 [k]j 2 [k] . . . i L [k] (4) 

The compression ratio is Km/n for a fixed codebook 
scheme. The codebook memory size is 

(2" 1 +2 ni+n2 + . . . +2" 1+ • • - + " L )mK bits, 

where n, represents the subcodevector bit length at level i, 
1 ^i^L and N L =2 nL represents the number of codevectors. 
The encoding complexity is 

2"i+2 n2 + . . . +2 nL 

operations per pixel. Compression ratios are more easily 
controlled by adjusting m (vector dimension) since the 
variation in n (codebook bit-length) significantly affects the 
codebook size and the encoding complexity. 

The Binary Tree-Searched VQ (BTSVQ) is a special case 
of tree-searched VQ. For the BTSVQ, the number L of 
tree-levels is equal to the codebook bit length (n). The 
encoding of the BTSVQ can be expressed as 

( 5 ) 
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-continued 

4‘ 1 = imn- 1 D(x ttl ,Q 1 i 2 ),/2 = 0,l 

if = min-'DCc lk \C u , <), i„ = 0, 1 5 

1 L n— l n 


For an RCVQ, namely a raw-codebook BTSVQ, the 10 
distortion computation between the input vector x [ * ] and the 
codevectors at the same binary tree level (C 0 and CJ is, 


rii W“1 rn 9 9 1fl~\ fti ^ 15 

D(x [k \Co)= X (x lk] (j? + C Q (j) 2 )-2 X x [k] (j)C 0 (j) 

7= o 7=0 

D(x [k \Ci) = JE* (x [k] (j) 2 + Ci(jf) - 2 x [k] (j)Ci(J) 

j = o 7=0 

20 

The codebook memory size is (2" + -2)mK bits. The 
encoding complexity is 2n operations per pixel. 

For a DCVQ, namely a difference-codebook BTSVQ, the 
distortion computation between the input vector x [k] and the 
codevectors at the same binary tree level (C 0 and C x ) is 25 
simplified as follows 


[Dix,C 0 )-D{x,C 1 )V 2 = 


(7) 


m- 1 , , m-1 rM 

£ (CoO*) 2 - Ci(/) 2 )/2 - X x [ % [C 0 (f) - C i(/)l = 
7= 0 7=0 


30 


A -Xx M (/)60*) 

7=0 


35 


Instead of saving of C 0 (j) and C^'), the terms, 


m— 1 9 9 40 

A= X (C 0 (/) 2 -Ci(/) 2 )/2 
7=° 

and are stored in the subcodebook. The 

codebook memory size is (2”-l) [m(K+l)+(2K+log m)] 
bits. The encoding complexity is n operations per pixel. 

The DCVQ is an improved version of the RCVQ. For the 45 
RCVQ, the encoding and hardware complexity is reduced 
by half of that of the RCVQ. This is a unique characteristic 
for a BTSVQ. 

Systolic Architecture for the Full-Searched VQ 
For most distortion measures, such as the weighted mean 
square error, the vector distortion can be shown as the 
weighted sum of the scalar distortion, i.e., 

m-l (9) 55 

d(i) = D( x, Q)= X w(j)D(x(j),Ci(j)), 

J=0 

for 0^i^N-l and 0^j^m-l, where x(j) represents the ] th 
component of the input data vector, C/j) the f l component 
of the i th codevector, w(j) the weighting factor in the 60 
distortion measure, and d(i) the distortion between x and C r , 
The index of the codevector of the minimum distortion 
represents the coded data of the input data vector, i.e.. 


i = min l d(i). (10) ^ 

0 § i ^ N - 1 


6 

For this class of distortion measure, the encoding proce- 
dure of the full-searched VQ shown in Equation (1) can be 
expressed in a general matrix-vector multiplication form, 
where the multiply operator represents the evaluation of 
scalar distortion and the add operator is the summation of the 
weighted scalar distortions. Therefore, Equation (9) can be 
systolic processed since matrix type computations are well 
suited for systolic processing. 

A systolic architecture for the full- searched VQ may thus 
be an array of processors, 0, 1, . . . ,N-1 and codebooks 0,1, 

. . . ,N-1, each codebook i having a stored codevector 
comprised of m components C/0), C f (l), . . . ,Q(m-l), as 
shown in FIG. 3. The distortion parameter, d(i), is associated 
with processor i where the distortion is computed, for 
0^i^N-l. The parameter d(i) accumulates the intermediate 
result as the codevector component C ( -(j) moves downward 
and the input data x(j) moves to the right synchronously. 
After m clock cycles, d(i) will consecutively contain the 
distortion between the input data vector and the \ th Code- 
vector. To perform Equation (9), two variables, I and D, are 
required to record the index and distortion of the codevector 
of the current minimum distortion. The variable D is ini- 
tialized to be a large number. Both I and D enter processor 
0 when d(0) is determined. They move down the array one 
processor per clock cycle. At processor i, D is compared 
with d(i). If d(i)<D, then I=i and D=d(i). As they flow out of 
processor N-l, I will contain the code vector index of the 
minimum distortion, representing the coded data. 

For continuous data encoding, the next data vector with its 
own pair of I and D follows right after the current data vector 
so that the data are continuously pumped into the array. This 
can be achieved by cycling the codevector components C f (j) 
into processor i as the input data flows into the array. Each 
d(i) is reset after the vector distortion is determined. 

For this systolic architecture having N processors and N 
codevectors, and each codevector has m components, the 
encoding speed is increased by a factor of N over a single 
processor architecture. The pipeline latency is N+m clock 
cycles. The throughput rate is constant at 1 pixel/clock for 
any vector dimension and code book size. Since typically N 
is chosen to be large to attain good reproduced image 
quality, a large number of processors are required. There- 
fore, in accordance with the present invention, by combi- 
nation of tree-searched VQ and systolic processing, a high 
throughput VQ encoder can be realized with minimal hard- 
ware. 

Systolic Architecture for Tree-Searched VQ 

Equation (3) shows that the tree-searched VQ encoder is 
in effect a series of the full- searched VQ encoders. The key 
is to correctly address the next level subcodebook. This can 
be realized by tagging the index of the current tree level 1 to 
the indices of the previous tree levels 1,2, . . . 1-1. The 
combined indices are then used to address the next level 
subcodebook 1+1. 

A systolic architecture for the tree-searched VQ is essen- 
tially a concatenation of L systolic arrays of the lull- 
searched VQ, where L is the number of tree levels. Each 
stage 1 corresponds to one tree level 1. The code vectors of 
each subcodebook are arranged as follows. Codevector 
components C fi . . . ^(j) are allocated to processor of the 1th 
stage array. There are N 2 . . . N z _ t m code vector components 
in each processor of the \ th stage array. During the VQ 
encoding, the codevector components are addressed by the 
combined indices of the previous stages, i 2 . . . i,_j. For this 
pipeline architecture the \ lh stage contains N z processors, 
which in total is 
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L 

XNi 

1=1 

processors. 


( 2 

V 1=1 

clock cycles. The system throughput rate is 1 pixel/clock, 
constant for any tree- structured codebook. 

Systolic Architecture for Binary Tree-Search Raw Code- 
book VQ 

A systolic architecture for the raw codebook binary tree- 
searched VQ (RCVQ) defined by Equation (6) is shown in 
FIG. 4 where the blocks d/0) and d/1) are distortion 
computation elements for implementing Equation (6); CP(0) 2 o 
and CP(1) are elements for comparison of the distortion; and 
buffer elements 1 delay the input data sufficiently to maintain 
synchronization of the data flow through the pipeline of 
distortion computation elements with the concatenated indi- 
ces used to address the next stage 1+1 codebooks. The 25 
preferred organization of each stage will be described more 
fully in the next sections. 

The input data sequence continuously flows into the array. 
Note that at each stage the data vector is compared with two 
codevectors in memory. After the index of the current tree 30 
stage (level) is obtained, it is tagged to the indices of the 
previous tree stages (levels) to address the next stage (level) 
subcodebook. The index is attained at a rate of one bit per 
stage. At the end of the array, the concatenated indices, n-L 
bits in length, are formed to represent the coded data. 35 

Since n=L for the binary tree-searched VQ, the overall 
system requires 2n processors. The pipeline latency equals 
n(2+m) clock cycles. The input data rate is 1 pixel per clock 
cycle, and the output data rate is n bits per m clock cycles. 
Therefore, the output data rate is effectively reduced by a 40 
factor of Km/n, the compression ratio. This systolic archi- 
tecture of FIG. 4 only requires a small number of processors 
compared to the full-searched VQ scheme. It has the advan- 
tages of modularity, regular data flow, simple interconnec- 
tion, localized communication, simple global control, and 45 
parallel/pipelined processing such that it is well suited for 
VLSI implementation. 

Preferred Design of Systolic Binary Tree-Searched Raw 
Codebook VQ 

An example of a preferred design RCVQ which lends 50 
itself to VLSI implementation for EOS on-board SAR 
applications is detailed in this section for a 10-bit codebook 
of a 4x4 pixel vector dimension. This results in 12.8:1 
maximum compression ratio. Limited flexibility in compres- 55 
sion ratio can be realized by varying the vector dimension. 

The mean square error criterion is chosen as the distortion 
measure. FIG. 5 illustrates the major functional blocks of a 
systolic binary tree-searched VQ encoder which are the 
processing element (PE) array 20, the VQ codebook 60 
memory banks 21 and an array controller 22, all of which are 
under synchronized control of an EOS Control and Data 
System (CDS) 23 as are a SAR processor 24 which presents 
the serial pixels in digital form and a downlink packetizer 25 65 
which forms packets of VQ data for transmission to a ground 
station. 


The pipeline latency is 

) 


Ni l + im 
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Detailed RCPE Design 

The PE array 20 performs the distortion computation of 
the VQ algorithm. For a VQ encoder with an n-bit codebook, 
this can be realized by n identical PEs. FIG. 6 shows a 
functional block diagram of a PE for a RCVQ. It is designed 
to compute the mean square error distortion between an 
input data vector and each codevector pair. 

The distortion computing of the raw codebook processing 
element (RCPE) design is primarily two mean square error 
operations. During the VQ encoding, the codevector pair 
components are addressed by the combined indices of the 
previous PEs i 2 . . . i z _ 2 . An accumulator accumulates the 
intermediate result as the codevector pair component C x and 
C 0 moves downward and the input data x(j) moves to the 
right synchronously. After m clock cycles, the accumulator 
will consecutively contain the mean square errors d 2 and do 
between the input data vector x and the selected codevector 
pairs. 

The index generator compares the distortion measurement 
d x and do- If d^do then ipO else lf=l. Index i z is tagged to 
the indices of the previous tree levels to correctly address the 
next level subcodebook. At the end of the array, the con- 
catenated indices, n bits in length, are formed to represent 
the coded data. 

The RCPEs are identical, designed to fit into a single chip 
using VLSI space-qualifiable 1.25 pm CMOS technology. 
Assessment based on a detailed logic diagram and VLSI 
layout of the RCPE shows that the gate count is about 3,000 
and the pin count about 37, which is well within the 
capability of present VLSI technology. 

A detailed functional design of an RCPE is shown in FIG. 
6. The pin name and definition of the RCPE and associated 
Memory Bank shown in FIGS, la and lb is summarized in 
the following table: 


Signal 

Type 

Description 

MEMORY BANK 


CLK 

Input 

System clock 

HA„EN 

Input 

To enable the pixel address generator 

HALD 

Input 

To load the hierarchical vector address 

CS (10:1) 

Input 

To enable the memory module #1 to #10 

A (15:0) 

Input 

System address bus 

H/A 

Input 

To select either system address or hier- 
archical encoding address 

R/W 

Input 

To select either memory read or memory 
write 

OE 

Input 

Tri- state output control 

D (15:0) 

Input 

System data bus 

DCn (15:0) 

Output 

16-bit output port of subcodebook #n 

PROCESSING ELEMENT 

DC (15:0) 

Input 

Codevector pairs from subcodebook module 

CLK 

Input 

System clock (at pixel rate) 

DI (7:0) 

Input 

8-bit input image data 

DO (7:0) 

Output 

8-bit 16-stage pipelined image data 

Hn 

Output 

Index of vector generated at PE#n 


Detailed Memory Bank Design 
The memory bank is composed of subcodebook memory 
modules, each storing a VQ subcodebook. FIGS, la and lb 
show a detailed functional design of the memory bank 21 in 
FIG. 5. For the binary tree-searched VQ, the n-bit codebook 
is divided into n(=L for binary tree-searched VQ) hierarchi- 
cal levels. The codevectors in each level 1 are stored in their 
corresponding memory module 1. The size of the memory 
module 1 is 2 / mK(=2* +7 ) bits. The total size of the memory 
bank is (2 n+1 -2)mK(=2” +8 ~2 8 ) bits. Although the modules 
of the memory bank differ in size, they assume a regular 
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structure in terms of memory cell design. To enable the 
programmability of the codebook, the memory bank can be 
accessed in both read and write modes by the host system 23 
of FIG. 5 via the array controller 22 during the initialization. 
During VQ encoding operation, each memory module can 5 
only be accessed to read or write by its associated RCPE. 
The total size in terms of the primitive memory cell for a 
10-bit codebook is 2 18 bits. 

Systolic Architecture for Binary Tree-Searched Difference 
Codebook VQ 10 

FIGS. 8 and 9 show the architecture of the systolic array 
for the difference-codebook BTSVQ. The input data vector 
sequence continuously flows into the array. For difference- 
codebook BTSVQ at each stage, the inner product between 
input data vectors and the difference code vectors is com- 15 
puted and compared with the 2 th order difference codewords. 
After the index of the current tree level is obtained, it is 
tagged to the indices of the previous tree levels to address 
the next level subcodebook. The index is attained at a rate of 
one bit per stage. At the end of the array, the concatenated 20 
indices of n-bit length are formed and represent the coded 
data of the corresponding input data vector. 

The array controller 22 interprets control parameters from 
the host system via the on-board S AR processor to set up P-0 
the BTSVQ encoder and provides status data for the host 25 
system to do house keeping. It also provides the interface 
timing to upload/download the data among the VQPEs, SAR 
processor 24 and downlink formatter 25. It also generates 
timing and control signals to operate the VQPEs 22. The 
array controller is implemented with a programmable logic 30 
array (PLA) device and several data buffers. Due to the 
localized data/control flow of systolic array processors, the 
array controller logic is simple. 

In this systolic difference codebook BTSVQ, each PE 
corresponds to one of several binary tree-levels, such as ten 35 
numbered 1 through 10 in the example to be described. The 
major functional blocks of each VQPE1, 2 ... n of a BTSVQ 
shown in FIG. 8 are a subcodebook memory 26, distortion 
computation data path 27 and index generator 32. 

For the DCPE of a BTSVQ, an n-bit codebook is divided 40 
and converted into n difference subcodebooks. The first- 
order and second-order differences of each codevector pair 
in level 1 are stored in the subcodebook as shown in FIG. 9. 
The size of difference subcodebook memory of DCPE at 
level 1 is 2 / “ 1 [m(K+l)+(2K+log m)] bits. 45 

Referring to FIG. 9, the distortion computing datapath 27 
of the DCPE design is primarily an inner product operator 
which is much simpler than the distortion calculator of the 
RCPE. During the VQ encoding, the difference-codevector 
components are addressed by the combined indices of the 50 
previous PEs, i x , i 2 . . . i w . An accumulator accumulates the 
intermediate result as the difference-codevector component 
5(j) moves downward and the input data x(j) moves to the 
right synchronously. After m clock cycles, the accumulator 
will consecutively contain the inner product A’ between the 55 
input data vector x and the selected difference codevector. 

The index generator compares the 2 th order difference 
codeword A with the distortion measurement A'. If A^A', 
then i/=l else 1^=0. Index i, is tagged to the indices of the 
previous tree levels to correctly address the next level 60 
sub codebook. At the end of the array, the concatenated 
indices, n bits in length, are formed to represent the coded 
data. The comparator-based index generator makes it easy to 
perform error detection for PE. However, the subtracter- 
based index generator has simpler hardware. 65 

Preferred Design of Systolic Binary Tree-Searched Differ- 
ence Codebook VQ 
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To attain the light-weight, small-volume, and low-power 
requirements, VLSI technology is preferred for implemen- 
tation of the DCPE of FIG. 9 as shown in FIG. 10. The 
building blocks include a pipeline buffer 30, one ID register 
31, multiplexers 32, 33 and 34, static RAM array 35, 
complement or 36, multiplier array 37, carry save adder 38, 
and comparator 39. The on-chip static RAM array 35 
includes a 512x9 RAM and an 32x20 RAM which are used 
to store the difference subcodebook up to level 6. For levels 
from 7 to 10, an additional external subcodebook memory is 
required for each level. An external memory interface is 
represented by an input EXTCD(8'.0) from external memory 
to a multiplexer 33 enabled by an input EXTCDEN for 
levels 7-10. This interface is built as part of each DCPE to 
support a 10-level systolic BTSVQ encoder with a common 
VLSI chip for each DCPE. 

To enable the programmability, the difference subcode- 
book memory 35 can be read out of and written into by the 
host system via the controller 20 (FIG. 8) during the setup 
mode. While in the encoding mode, each subcodebook 
memory can only be read out of and written into by its 
associated PE. In the setup mode, the first-order codevector 
differences 5 are stored into the subcodebook memory 35. 
Meanwhile, the second-order codevector differences A are 
entered and stored in a threshold register 40 of each PE. In 
the encoding mode, the input vectors, D 1(7:0), are received 
from the on-board SAR processor 24 via the array controller 
22 . 

The PE performs an inner product between the input 
vectors and the codevector differences. The inner product is 
stored in a register 41 and compared with the second-order 
codevector differences A stored in the threshold register 40 
at the rising edge of a vector clock VCLK. A one-bit index 
bit is generated at level 1 and concatenated with index bits of 
the previous PEs for lower levels to address the next level 
1+1 subcodebook. The concatenated index bits of the last PE 
thus formed represent the coded data for the input data 
vector x. The pin name and definition of DCPE is summa- 
rized in the following table: 


Signal 

Type 

Description 

VCLK 

Input 

Vector clock 

PCLK1 

Input 

Pixel clock (phase 1) 

PCLK2 

Input 

Pixel clock (phase 2) 

AB (8:0) 

Input 

9-bit system address bus for subcodebook 
memory 

D (19:0) 

Input 

20-bit system data bus for subcodebook 
memory 

WRCD* 

Input 

Write enable of subcodebook (active low) 

DI (7:0) 

Input 

8-bit input image data 

DO (7:0) 

Output 

8-bit 16-stage pipelined image data 

WRCSD* 

Input 

Write enable of threshold register 

EXTCD (8:0) 

Input 

9-bit codeword from the external subcode- 
book memory 

EXTCDEN 

Input 

To enable multiplexer to accept 
EXTCD (8:0) 

AP (3:0) 

Input 

Address of pixel elements of vectors 

IDP (8:0) 

Input 

9-bit concatenated indices from previous 
PEs 

ID (9:0) 

Output 

10-bit concatenated indices 


Fault Tolerance Design 

For a space mission, it is reasonable to assume a 5 to 10 
year unmaintained mission life with a processor reliability 
goal well above 0.95. A fault tolerant architecture is required 
to achieve these goals. By combination of architectural fault 
tolerance and inherent error detection capability, a highly 
reliable VQ encoder can be attained, such as by a pro- 
grammed diagnostic routine initiated by the control and data 
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system which supervises the SAR processor, VQ compres- 
sor and downlink packetizer. When a fault is detected in any 
one PE, a “fault” signal is generated and associated with the 
PE suffering a fault. 

As shown in FIG. 11, the linear systolic array of the VQ 5 
encoder is augmented with a spare Processing element SPE 
at the end of the array and dynamic reconfiguration switches 
(RS). Two switch designs, type RS-A and type B, are 
presented to support the fault tolerance reconfiguration. If 
there is a permanent fault in any active PE, the faulted PE 
will be detected and bypassed by a type RS-B switch at its 
output. Meanwhile the spare Processing element SPE at the 
end of the array will be activated by type RS-A switches for 
all PEs downstream in the array. The spare Processing 
element SPE is bypassed by a type RS-B switch at its output 
until called upon to serve. It is at that time that the VQ 15 
codebooks of the PEs are all switched starting with the PE 
having a fault and thus shifting each PE code book to the 
next PE of the array in a direction from the input end to the 
output end of the PE array. The reconfiguration switches are 
controlled by a “fault” signal stored in an array register by 20 
the diagnostic subroutine system which conducts the tests 
for detection of a faulty PE during the set-up time before 
encoding SAR data for transmissions. 

In detecting a fault, a single computation unit (such as 
multiplier or adder) fault model may be used where it is 2 s 
assumed that at most one PE could suffer a fault within a 
given period of time which will be reasonably short com- 
pared with the mean time between failures. Since effective 
error detecting and correcting schemes, such as parity and 
Hamming codes, exist for communication lines and memo- 
ries, failures in these parts can be readily detected and 3 
corrected by those methods. The fault mode concentrates on 
the permanent failures of a PE. 

Two basic mechanisms can be applied to detecting faults 
in this type of system: on-line concurrent error detection and 
periodic self-test. On-line single error correction for arith- 35 
metic operations can be accomplished by arithmetic codes 
such as AN code or Residue code. For the EOS SAR 
processor, temporary distortion of images due to transient 
faults may be tolerable. Hence second error if any can be 
detected by periodic self-test which is performed during 40 
power-up and periodically during operation by temporarily 
halting compression of data. For the dual data path (RCPE) 
implementation, each PE is tested by applying the same 
input data and codevector to both its paths and use the 
comparator to determine if the two results are equal or not. 45 
If they are not equal, a permanent or a transient fault may 
exist in the PE. To determine whether it is a transient fault 
or a permanent fault, the same input and codevector are 
reapplied following the first detection of error. If the two 
data paths still generate different results, a permanent fault 50 
has been detected and reconfiguration is needed to avoid 
faulty PE, 

For the DCPE design, predetermined test inputs are 
applied since there is only one data path and precomputed 
results corresponding to the inputs need to be stored. The 55 
comparator then compares the generated results with the 
stored values. If the two are the same, the PE is fault-free: 
otherwise, the same input is reapplied to find out whether it 
is a permanent or transient fault. Following the location of 
the faulty PE, the spare PE is switched in to maintain the size 60 
of the PE array. 

The hardware overhead of the self-test and reconfigura- 
tion scheme is about 20%. In PE level, the overhead hard- 
ware includes two reconfiguration switches, one multi- 
plexer, two registers, two comparators, one flag resister, one 65 
n-input OR gate, one control line, n input lines, and one 
output line. In PE array level, only one spare PE is required. 
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It has been shown that error correction using arithmetic code 
is also cost effective. The encoding introduces redundant bits 
in the number representation. A proportional hardware 
increase takes place in register array and data path. The 
estimated hardware overhead is from 20% to 40% which 
should be able to fit in the PE chip of available die size 300 
milsx300 mils. 

The reliability improvement can be addressed as follows: 
If each PE has a reliability of R, then the reliability of 10 PEs 
is R 10 . For the reconfigurable array with one spare PE, the 
reliability becomes R n +ll R 10 (1-R). For example, if 
R==0.95, the reliability of nonredundant PE array is 0.60 
while the reliability of redundant array is 0.90. This repre- 
sents a 50% increase in reliability. 

Conclusion 

Although particular embodiments of the invention have 
been described and illustrated herein, it is recognized that 
modifications and variations may readily occur to those 
skilled in the art. Consequently, it is intended that the claims 
be interpreted to cover such modifications and variations. 

We claim: 

1. In a systolic-array image processing system, a full- 
searched vector quantizer for data compression comprising 
an array of N processors, with N distortion parameters, d(i), 
one for each processor, and N codevectors stored in a 
codebook, where each stored codevector C f comprises m 
components C f (0), . . . , C f (m-1), said array of N processors 
processing input image data vectors and codevectors to 
generate said distortion parameters as a weighted sum of 
scalar distortion in accordance with the following equation: 


m - 1 

d(i) = D(x, C/) = 2 Q(j)\ 

J = 0 

where d(i) is the distortion between an input vector x and a 
stored codevector which is the x th codevector of the 
codebook, and D(x,C t ) is said distortion parameter as a 
function of said input vector x and said stored codevector C* 
of the \ th processor for 0^i^N-l and 0^j^m-l, where x(j) 
represents the ) th component of the input data vector x, C t Q) 
is the ] th component of the x th codevector C t w(j) being the 
weighting factor in the distortion measure, and wherein an 
index, i, of said codevector Cj of minimum distortion, 
i=min _1 d(i), represents a vector quantization coded data of 
an input vector, where said index corresponds to the I th 
processor, 0^i^N-l. 

2. In a systolic-array image processing system, a tree- 
searched vector quantizer for image data compression com- 
prising a series of L systolic arrays of N, identical proces- 
sors, and a plurality L of levels of subcodebooks, where 1 is 
the tree level index from 1 to L, one level of subcodebooks 
for each of L systolic arrays, and means for successively 
comparing an input vector sequence x [k] with stored code- 
vectors in subcodebook levels in search for an output coded 
data sequence i [k] of minimum distortion in accordance with 
the following equations: 

i [ l ] = 

0 ^ ii = tfl - 1 

4* 1= min-'D^raCf^) 

0§ j 2 SJV2-l 
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i 2 w =min“ 1 D(xW,C ilt . 2 ) J i 2 =0,l 

-continued 

2 ... v=0,l 


tL = min D{x l \Ci^k]i£k ] . . . 

0 = i L = N L - 1 

where x [ * ] is an input data vector sequence, k represents the 10 
time index of said sequence, and the codevector notation is: 

C tj for level 1; for level 2; and so forth with . . . i 
for level L, and \ x is a codevector index for tree^evel-f 
subcodebook i x , i 2 . . . \ L is a codevector index for tree 
level-L subcodebook, D(x m C ; - ) is a distortion function, 15 
and ! w is a vector quantization coded output data sequence 
of said input vector sequence for binary tree level-1 encoding 
process, N*-l is the maximum value of said vector coded 
output data for tree level 1, l^l^L. 

3. In a systolic-array image processing system, a binary 20 
tree-searched raw codebook vector quantizer for image data 
compression comprising a series of n systolic arrays of two 
identical processors and a plurality n of levels of subcode- 
books, one level of subcodebooks for each systolic array for 
successively comparing an input vector sequence x [ * ] with 25 
selected codevector pairs in subcodebook levels in accor- 
dance with the following equations: 

^Wnin^Dfr^Cy, ^=0,1 

i 2 W=mn-‘D(xW,Q li2 ), i 2 =0,l 30 

l„ 1 * 1 =nutr , D(x l * 1 ,Q 1 j 2 . . . i„=0,l 

r „ 35 

where x w is an input vector sequence and k represents the 

time index of said sequence, and the code vector notation is: 

C fi for binary level 1, C lV - 2 for binary level 2, and so forth 
with C til in for level n, i 2 is code vector index for binary 
tree level-1 subcodebook, i 2 i 2 is codevector index for binary 40 
tree level-2 subcodebook, and so forth i x i 2 . . . i n is 
codevector index for binary tree level-n subcodebook, n is 
the level number of the deepest binary tree, D(x [fc] ,C . . . ) is 
a distortion function, i/* 5 is a vector quantization coded 
output data sequence of said input vector sequence for 45 
binary tree level-1 encoding process and 1 is a binary tree 
level index from 1 to n, i [ * ] is a vector quantization coded 
output data sequence of said input vector sequence for 
overall encoding process. 

4. In a systolic- array image processing system, a binary 50 
tree-searched difference codebook vector quantizer for 
image date compression comprising a series of systolic 
arrays of identical processors and a plurality of levels of 
subcodebooks, one level of subcodebooks for each systolic 
array for successively comparing an input vector with 55 
selected codevector pair difference in subcodebook levels in 
accordance with the following equations: 


1 ^1 l 2 • ■ ■ l n 

where x [ * ] is an input vector sequence and k represents the 
time index of said sequence, and the codevector notation is: 
C i{ for binary level 1, C i i2 for binary level 2, and so forth 
with C r y 2 in for level n, i 2 is codevector index for binary 
tree level- 1 subcodebook, i x i 2 is codevector index for binary 
tree level-2 subcodebook, and so forth iii 2 . . . i„ is 
codevector index for binary tree level-n subcodebook, n is 
the level number of the deepest binary tree, D(x [fe] ,C . . . ) is 
a distortion function, i^* 3 is a vector quantization coded 
output data sequence of said input vector sequence for 
binary tree level-1 encoding process, i [k] is a vector quan- 
tization coded output data sequence of said input vector 
sequence for overall encoding process, wherein said distor- 
tion function between an input vector x [k] and codevectors at 
the same binary tree level C 0 and C 2 are 

D(x M ,C 0 ) = '"z (x [ %? + C 0 (jf) - 2 x m (j)C 0 (f) 

J = 0 ]= o 

D(x m ,Ci) = '"i: 1 (x'V + Ci(jf) -2 m i 1 

j= 0 ]= o 

x [k] Q) being the ] th component of the input vector, j being the 
component index of the input vector, m being the number of 
component of the input vector, C t (j) being the } th component 
of the \ th codevector, i being the codevector index of the 
codebook, C 0 and C 2 being the code vector pair of the 
subcodebook in the same binary tree level, where codebook 
memory size is (2 n+1 -2)nK bits, n is a maximum number of 
tree levels, K is a number of bits per pixel, said distortion 
computation between input vector x [k] and codevectors at 
the same binary tree level C 0 and C 2 is simplified as follows: 

Wl 1 r) rt 

[D(x,Co)-D(x,Ci)]/2= 2 '.{CoQT-CiQrW- 
J = 0 

"i 1 X m (j) [Co(7)-Ci(/)]=A-“z x [ %m 
j = 0 7=0 

and instead of saving C 0 (j) and CjQ), the terms, 

A = m L(C 0 (/) 2 -Ci(/) 2 )/2 

7=0 

and 5(j)=C 0 (j)-C 1 (j) are stored in said codebook, where 
codebook memory size is (2 n -l) [m(K+l)+(2K+log m)] 
bits. 


i 1 t *tnnn- 1 D(x I «,Q i ) J i 1 = o,l 


* * * * * 



